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Comparisons of DNA and protein sequences between humans and model organisms, including the yeast 
Saccharomyces cerevisiae. the nematode Caenorhabditis ekgans, and the fruit fly Drosophila meianogaster, are a significant 
source of information about the function of human genes and proteins in both normal and disease states. 
Important questions regarding cross-species sequence comparison remain unanswered, including (1) the fraction 
of the metabolic, signaling, and regulatory pathways that is shared by humans and the various model organisms; 
and (2) the validity of functional inferences based on sequence homology. We addressed these questions by 
analyzing the available fractions of human, fly, nematode, and yeast genomes for orthologous protein-coding 
genes, applying strict criteria to distinguish between candidate orthologous and paralogous proteins. Forty-two 
quartets of proteins could be identified as candidate orthologs. Twenty-four Drosophila protein sequences were 
more similar to their human orthologs than the corresponding nematode proteins. Analysis of sequence 
substitutions and evolutionary distances in this data set revealed that most C. elegans genes are evolving more 
rapidly than Dmsopbila genes, suggesting that unequal evolutionary rates may contribute to the differences in 
similarity to human protein sequences. The available fraction of Drosophila proteins appears to lack 
representatives of many protein families and domains, reflecting the relative paucity of genomic data from this 
species. 



Similarities between novel protein sequences and 
their better-characterized counterparts in sequence 
databases are an increasingly important source of 
hypotheses concerning protein functions. Particular 
attention has been paid to identifying homologs of 
medically relevant human proteins in genetically 
tractable model organisms, such as mice, the fruit 
fly Drosophila meianogaster, the nematode Cae- 
norhabditis elegans, the yeast Saccharomyces cereve- 
siae, and bacteria (Banfi et al. 1996; Bassett et al. 
1997; Mushegian et al. 1997). Whole-genome com- 
parisons of microbial proteins (Koonin et al. 1997) 
have emphasized the importance of distinguishing 
orthologs, that is, proteins in two species that have 
evolved by vertical descent from a common ances- 
tor and are presumed to have the same function 
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(Fitch 1970), from paralogs, namely proteins de- 
rived from lineage-specific duplication and domain 
shuffling that hence may have more divergent func- 
tions. Failure to resolve orthologs and paralogs can 
lead to misinterpretation of cellular biochemistry 
(Tatusov et al. 1996; Henikoff et al. 1997) and inac- 
curacies in molecular evolutionary reconstructions 
(Doolittle et al. 1996; Feng et al. 1997). This distinc- 
tion has been addressed in a protein domain analy- 
sis of positionally cloned human genes that are mu- 
tated in specific diseases ("disease genes") and their 
counterparts in the yeast and nematode genomes 
(Mushegian et al. 1997). Although almost equal 
fractions of the human disease genes had regions of 
significant similarity to nematode and to yeast pro- 
teins, the latter study identified a true ortholog in 
the complete yeast proteome for only 20% of hu- 
man proteins. In contrast, 30% of human disease 
genes had candidate orthologs in the -50% com- 
pleted nematode proteome then available. 
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Drosophila and C. elegans have emerged as at- 
tractive model animal systems for studying human 
gene pathways, because of their genetic tractability. 
phenotypically well-characterized genes, and 
progress in whole-genome sequencing (Rubin 1996; 
Ahringer 1997). Traditional morphology-based phy- 
logenies have placed C. elegans and other nema- 
todes in a basal metazoan clade composed of pseu- 
docoelomate animals, whereas Drosophila and other 
arthropods have been placed in the more recently 
derived protostome clade of eucoelomate animals 
with a shorter phylogenetic distance to vertebrates 
and other deuterostomes (for review, see Brusca and 
Brusca 1989). However, the notion that arthropods 
belong to a later evolutionary branch than nema- 
todes (e.g., Sidow and Thomas 1994) has been chal- 
lenged by recent studies based on analysis of mor- 
phology (Nielsen 1995), of ribosomal RNA se- 
quences (Aguinaldo et al. 1997), and of selected 
protein sequences (McHugh 1997). Notably, the es- 
timated sizes of the Drosophila and C. elegans ge- 
nomes and proteomes are quite similar, on the order 
of 100 MB of genomic DNA and 15,000 genes (Mik- 
los and Rubin 1996; Waterston 1997). Despite this 
information, the representation of proteins and 
conserved protein domains in these two proteomes 
has not been approached systematically. 

The present study was designed to identify a 
substantial set of orthologous protein-coding genes 
in the eukaryotic model organisms by using strict 
criteria to define orthologous candidates. We then 
analyzed this set of proteins to assess the relative 
similarity of Drosophila and C. elegans proteins to 
their human orthologs. As a complementary ap- 
proach toward the evaluation of the model organ- 
isms, we sought to estimate the fractions of con- 
served domains and to compare the composition of 
multidomain proteins in the available protein sets 
of Drosophila and C. elegans. 



RESULTS AND DISCUSSION 

Forty-Two Quartets of Candidate Orthologs 

To identify all potential orthologous genes in hu- 
mans, nematode, fly, and yeast, we first searched 
the complete S. cerevesiae proteome (6141 protein 
sequences) using as queries identified Drosophila 
proteins (2142 available sequences as of March 1, 
1997, then excluding peptides shorter than 110 
amino acids). The modified BLATAX program was 
used to tabulate the highest scoring matches for 
each Drosophila protein. The resultant 848 fly pro- 
teins were then used to search the nonredundant 



protein sequence (NR) database and extract the best 
matches from humans. The human sequences re- 
trieved in this way were examined to remove in- 
complete proteins and sequences that occurred 
more than once as a result of being the best match 
for two or more Drosophila proteins. The remaining 
set of 480 human proteins was used to search the 
NR database. The best matches to these human pro- 
teins from Drosophila. C. elegans, and S. cerevesiae 
were then extracted by the modified BLATAX pro- 
gram using the similarity score measure. Next, we 
removed low-scoring sets and spurious hits repre- 
senting matches in low-sequence complexity and 
coiled-coil segments. The resultant set of proteins 
was filtered to derive candidate orthologs by exclud- 
ing (1) sequences for which the yeast sequence was 
closer to a human homolog than either a fly or 
nematode sequence (the third criterion of orthol- 
ogy. see Methods): and (2) proteins that shared high 
similarity in one domain but differed in overall do- 
main architecture from the human protein (the sec- 
ond criterion of orthology). We then excluded 
members of expanded protein families as listed in 
Methods. The remaining protein quartets were sub- 
jected to a final round of filtering based on recipro- 
cal BLAST? searches (the first criterion of orthol- 
ogy)- 

The resulting set of proteins contained 42 quar- 
tets of human, Drosophila. C. elegans, and S. cereve- 
siae candidate orthologs (Table 1). The 42 proteins 
comprising the data set of orthologs varied exten- 
sively in length, for example, the human proteins 
contained 1 16-2225 amino acids, with a median of 
349 residues. This is a significantly broader spec- 
trum of sizes than the set of 64 enzymes (151-935 
residues) used in a recent large-scale phylogenetic 
comparison (Doolittle et al. 1996). Moreover, the 
present set of candidate orthologs samples many of 
the functional categories essential in the eukaryotic 
cell, including genome replication and expression, 
organelle structural components, and signal trans- 
duction (Table 1). 

Different Proteins Generate Different Phylogenetic 

Tree Topologies 

Most of C. elegans proteins in the databases, and 33 
of 42 nematode proteins in current data set, are pre- 
dicted from the genomic sequence, whereas all 42 of 
the Drosophila orthologs were derived from full- 
length cDNA sequences. Therefore, additional mea- 
sures were taken to verify the orthologous candi- 
dates. First, the human protein sequences were used 
as queries to search the database of unfinished 
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Table 1. Forty-Two Quartets of Orthologous Proteins 
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Length in amino acids Is given for the human ortholog. (See Fig. 1 for the three possible tree 
topologies A, B. and C.) (ND) Not determined because yeast-human % Identity <3S% (see 
Methods). (*) Nematode SERCA ortholog sequence was determined manually from unfinished 
sequence of cosmid K11D9. (#) Number. The list of entries including rodent and bacterial 
orthologs is available at http://vvww.sequana.com/publications/model_proteomes. 



nematode DNA for possible 
missed exons. Second, the 
EST databases were searched 
for the higher scoring se- 
quences in the nematode 
and fly. Only one nematode 
sequence (synaptobrevin) 
was found among the ESTs 
that was a better ortholo- 
gous candidate than the se- 
quence in the NR database, 
and was included in the 
analysis. 

The possible relation- 
ships of the human, nema- 
tode, and fly sequences can 
be described by three differ- 
ent tree topologies (Fig. 1): 
(1) tree A, in which the fly 
sequence is a sister taxon to 
the human sequence with 
the nematode sequence 
basal to the fly-human 
clade; (2) tree B, wherein the 
fly and nematode sequences 
are sister taxa; and (3) tree C, 
in which the nematode and 
human sequences are sister 
taxa with the fly sequence 
basal to the nematode-hu- 
man clade. 

Thirty-six protein quar- 
tets whose metazoan mem- 
bers contained amino acid 
identities of &35% to the 
yeast sequence were sub- 
jected to individual phyloge- 
netic analysis as described in 
Methods. Neighbor-joining 
analyses of the 36 ortholo- 
gous protein sequence align- 
ments (Fig. 2), revealed that 
24 quartets generate tree A 
(with average bootstrap val- 
ues of 80% ± 18 S.D.), 11 
amino acid alignments sup- 
port tree B (average boot- 
strap values of 61% ± 15). 
and 1 alignment produced 
tree C with a bootstrap value 
of 45. Results were essen- 
tially identical when gam- 
ma-corrected distances were 
used. These data are consis- 
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Tree A [^fm^i 



Figure 1 The three possible topologies for a tree de- 
scribing the evolutionary relationships between nema- 
todes, arthropods, and humans. Tree A (blue) reflects 
the conventional interpretation of metazoan phytog- 
eny with nematodes as a "protocoelomate" group 
basal to arthropods and humans. Tree A was supported 
by neighbor-joining analysis of 24 protein quartets as 
described in the text. Tree B (red) represents the "Ec- 
dysozoa" phytogeny derived from 18S rRNA gene se- 
quences of a variety of nematodes and arthropods 
(Aguinaldo et al. 1 997), and is supported by 1 1 protein 
quartets. Tree C (green) is not expected from any 
metazoan phylogenetic hypothesis and is supported 
by a single protein quartet. Average bootstrap values 
and their standard deviations are shown for each tree. 



tent with the distribution of the pairwise similarity 
scores observed in BLAST searches. Eight of the nine 
C. elegans protein sequences derived from full- 
length cDNA were less similar to their human or- 
thologs than the corresponding Drosophila protein 
sequences (i.e., supported tree A), suggesting that 
the prevalence of Tree A was not an artifact of com- 
putational prediction of nematode proteins. 

Similarity of Fly and Nematode Proteins to Human 
OriJiologs May Be Influenced by Unequal 
Evolutionary Rate Effects 

Phylogenetic hypotheses based on molecular se- 
quence data are affected by two important factors 
governing evolutionary rate. The first factor is gene- 
to-gene variation, where different genes have differ- 
ent evolutionary rates among a given pair of taxa. 



usually as a result of functional constraints on the 
encoded protein. The second factor is evolutionary 
rate heterogeneity within a gene, where the evolu- 
tionary rate can vary among different lineages of a 
tree. A homogeneously evolving gene evolves at the 
same rate per unit time among all branches of a tree, 




Evolutionary Distance 

Figure 2 Relative evolutionary rates of the 36 protein 
quartets subjected to phylogenetic analysis. Protein 
quartets supporting tree A are shown in blue, those 
supporting tree B are shown in red, and the quartet 
supporting tree C is shown in green. The protein name 
abbreviation is shown along the y-axis, and the pro- 
teins are plotted in order of the mean evolutionary 
distance of nematode to human and arthropod to hu- 
man where nematode is C. elegans, and arthropod is D. 
melanogaster. Proteins with the highest number of 
pairwise substitutions (fast evolving) are at the top and 
those with the lowest number of pairwise substitutions 
(stow evolving) are at the bottom. The evolutionary 
distances along the x-axis were determined from 
amino acid alignments using a Poisson correction as 
described in Methods. The broken line {middle left) rep- 
resents the midway point where 18 proteins are above 
the line and 18 are below the line. The key to the bars 
is shown. 
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whereas a hetcrogencously evolving gene may 
evolve more rapidly in some lineages than others, 
causing unequal rate effects that are known to pro- 
duce tree-building artifacts (Hillis et al. 1994; Lyons- 
Weiler and Hoelzer 1997; Maley and Marshall 
1998). For example, the 18S rRNA gene has evolved 
at a rapid rate of nucleotide substitution in C. el- 
egans compared to other animals (Winnepenninckx 
et al. 1995; Garey et al. 1996), causing unequal rate 
effects that artificially place C. elegans as a basal ani- 
mal, as in tree A of Figure 1. However, IBS rRNA 
genes from most nonrhabditid nematode taxa ap- 
pear to have evolved at a slower rate than in C. 
elegans (Blaxter et al. 1998), and when nematode 
IBS rRNA sequences with lower substitution rates 
are analyzed, nematodes emerge as a sister taxon to 
arthropods (Aguinaldo et al. 1997), as in tree B of 
Figure 1. 

The finding that only one quartet supported 
tree C was expected because the possibility that ar- 
thropods are basal to both nematodes and humans 
is not supported by any hypothesis of metazoan 
phylogeny of which we are aware. The finding that 
24 quartets support tree A, whereas 1 1 quartets sup- 
port tree B has several possible explanations. One is 
that tree A reflects the correct historical phylogeny, 
and that the quartets supporting trees B and C rep- 
resent random noise. An alternative explanation is 
that the finding of the majority of quartets support- 
ing tree A is caused by unequal evolutinary rate ef- 
fects, as in the 1 BS rRNA gene of C. elegans (Aguin- 
aldo et al. 1997). To assess gene-to-gene variation 
and evolutionary rate heterogeneity among these 
36 orthologous proteins of humans, nematodes, 
and arthropods, the pairwise evolutionary distances 
from human to nematode were compared with 
those of human to arthropod for each quartet. 
These evolutionary distances are shown as a bar 
graph in Figure 2, with the quartets ordered from 
the fastest to slowest evolving proteins. There are 
fewer pairwise substitutions between arthropod and 
human than between nematode and human except 
in five quartets (EF2, SYB. METK, ATPase, PYRl). 
Figure 2 shows that 8 of 1 1 quartets that support 
tree B fall among the slower half of the quartets, 
whereas only three of the quartets supporting tree B 
fall among the faster half of the quartets. Five of the 
eight slowest evolving quartets (ATPB, CDC42, 
RL17, EF2, SERCA) support tree B. These five pro- 
teins have fewer overall substitutions, thus unequal 
evolutionary rates are less likely to develop and tree 
B is favored. The faster evolving quartets that sup- 
port tree B (GPDA, RL22. MA12) likely represent 
genes that have a high number of substitutions, but 
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where the number of substitutions arc homoge- 
neous between the taxa within the quartet. 

To visualize the degree of evolutionary rate het- 
erogeneity among all four taxa. we plotted the four- 
way relative rates for the 36 proteins in Figure 3. The 
y-axis in Figure 3 displays the ratio of the evolution- 
ary rates of nematode and arthropod relative to 
yeast, which should equal one if a protein evolved 
homogenously in the two lineages, as yeast is un- 
doubtedly an outgroup to nematode and arthropod. 
However, most of the proteins have ay-value greater 
than one, indicating that they have evolved more 
quickly in the lineage from yeast to nematode than 




0.8 Gtt i 1,S 14 1.6 1.8 Z 



Mtio iHimai»-<>»mi»toil«/hiMn«iMiTlMropod 

Figure 3 Four-way relative rate plot of evolutionary 
distances for 36 proteins. The ratio of evolutionary dis- 
tances from (human-nematode)/(human-arthropod) 
for each protein is plotted on the x-axis, where a ratio 
of 1 would be expected if the proteins were evolving 
homogeneously in those branches assuming that ar- 
thropods and nematodes are sister taxa. The ratios of 
evolutionary distance from (yeast-nematode)/(yeast- 
arthropod) are plotted on the y-axis, which should 
equal 1 if a protein evolved homogenously in the 
nematode and arthropod lineages. The position where 
the X-axis and y-axis both equal one represents the 
region where genes would fall if they evolved homo- 
geneously In all four taxa, if Tree B is correct. Proteins 
to the right of the vertical line at x = 1 should favor tree 

A, proteins to the left should favor tree C, whereas 
proteins falling near the diagonal line should favor tree 

B. The distribution of the 36 orthologous proteins is 
skewed, with those that yield tree B (red squares) scat- 
tered uniformly around the diagonal line (with one ex- 
ception, CDC42 supports tree B but falls to the ex- 
treme right of the graph), whereas all of the proteins 
that yield tree A (blue diamonds) are scattered to the 
right of the diagonal. The quartet favoring tree C is 
shown in green (triangle). 
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from yeast to arthropod. Similarly, the majority of 
the proteins have an x value >1. indicating that 
most have evolved more quickly in the lineage from 
human to nematode than from human to arthro- 
pod (assuming that nematodes and arthropods are 
sister taxa). The distribution of data points along 
the AT-axis is highly skewed, with most of the pro- 
teins between 0.9 and 1.3, and the remainder with 
higher values. Of note, proteins supporting tree B 
are clustered more closely to the point where the x- 
and y-axes both equal one (representing proteins 
that evolved homogeneously in all four taxa), but 
proteins supporting tree A are all to the right with 
most far to the right (representing proteins that 
evolved more heterogeneously). Thus, these or- 
thologous proteins can be divided into two popula- 
tions: homogeneously evolving proteins that sup- 
port tree B, and heterogeneously evolving proteins 
that support tree A. This four-way relative rate plot 
suggests that the preponderance of proteins sup- 
porting tree A is largely attributable to unequal evo- 
lutionary rates. Given that the orthologous proteins 
selected for this study were extracted from total 
available genome data, it is reasonable to infer that 
approximately two-thirds of protein-coding genes 
in C. elegans have evolved more rapidly than in Dro- 
sophila. 

Highly Conserved Protein Domains and Diversity 
of Multidomain Proteins in DwsophUa and C. el^ns 

C. elegans and DwsophUa, as the most highly devel- 
oped genetically tractable model animals, are attrac- 
tive systems for studying human disease pathways 
(Miklos and Rubin 1996; Ahringer 1997). In prac- 
tice, the ability to extract functional inferences de- 
pends more on the actual content and complexity 
of the protein repertoire in different model organ- 
isms than on their deduced taxonomic positions per 
se. To address the question of protein domain diver- 
sity in these two proteomes, we first masked puta- 



tive nonglobular segments and coiled coils in the 
Drosophila and C. elegans protein data sets. These 
two classes of sequences often serve as hinges be- 
tween globular domains and tend to produce spuri- 
ous hits with database searches (Altschul et al. 
1994). After deleting redundant domains in each 
database, we constructed Drosophila and C. elegans 
libraries of unmasked protein domains >50 amino 
acids and compared these libraries to each other, to 
the complete yeast proteome, and to the publicly 
available human ESTs. Matches with a similarity 
score of >90 were counted, a cutoff that virtually 
ensures evolutionary relation and functional rel- 
evance under the applied conditions (Koonin et al. 
1997; A.R. Mushegian, unpubl.). 

The results of this analysis are summarized in 
Table 2. The most conspicuous result is that the 
available portion of the Drosophila proteome is 
strongly enriched in conserved domains as com- 
pared to that of C. elegans. It seems unlikely that 
Drosophila has retained a larger fraction of proteins 
descended from an ancestral unicellular eukaryote. 
while also becoming enriched in protein domains 
shared with humans. Rather, we suspect that this 
difference is largely attributable to overrepresenta- 
tion of certain classes of Drosophila proteins in cur- 
rent databases, given that only a small fraction of 
the fly genome has been sequenced. A substantial 
increase in Drosophila genomic DNA sequence will 
clearly be required before the question of domain 
repertoire in this organism can be addressed in a 
more definitive way. 

A hallmark of eukaryotic genome evolution is 
the increased number of multidomain proteins 
thought to have originated largely by domain shuf- 
fling (Doolittle 1995). Because a comprehensive 
comparison of protein sets of C. elegans and D. me- 
lanogaster is limited by the extent of whole-genome 
sequencing, we wished to analyze a representative 
set of multidomain proteins in both available pro- 
teomes. Toward this end, we extended an earlier 



Table 2. Protein Domain Conservation in Model Organisms 





No. of 

domains^ 




Domains conserved in 




Species 


C. elegans 


Drosophila 


yeast 


human ESTs 


C. elegans 
Drosophila 


13169 
2611 


100% 
1905 (73%) 


3133 (24%) 
100% 


3741 (28%) 
1264 (48%) 


5305 (40%) 
2042 (78%) 



Only amino acid sequence similarities with the BLAST2 score higher than 90 are reported. 

'The FASTA files of the domains in both model organisms is available on-line at http://www.sequana.com/ 

publications/modeLorganisms. 
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analysis of 77 proteins encoded by iiuman position- 
ally cloned genes specifically mutated in hereditary 
diseases, a set consisting largely of complex mul- 
tidomain proteins (Mushegian et al. 1997). Among 
84 proteins in the updated disease gene database 
(XREFdb as of July 1, 1997; sec Methods), 68 (81%) 
shared similarity with nematode proteins, but in the 
majority of these cases the similarity was limited to 
individual domains within larger proteins with a 
different overall domain architecture. By our criteria 
of orthology, which requires similarity along the en- 
tire length of multidomain proteins and not just 
individual domains (see Methods), 25 human dis- 
ease proteins (37% of all detected similarities) had 
candidate orthologs in the sequenced portion of C. 
elegans proteome (Table 3). Of the 34 human pro- 
teins with similarity matches in the fly proteome, 
13 proteins (38%) had candidate orthologs in D. 
melanogaster. Thus, the likelihood of nematode and 
fly proteins possessing the same domain architec- 
ture as human disease gene products was remark- 
ably similar. 

The available portion of the fly proteome con- 
tains a high proportion of protein sequences ob- 
tained through positional cloning of phenotypi- 
cally significant genes as well as genes specifically 
cloned by homology to mammalian proteins. 
Therefore, one might expect that protein families 



would be unevenly represented among available 
Drosophila proteins. We addressed this issue by que- 
rying the Drosophila and C. elegans proteomes with 
additional sequences of biological interest. In one 
query using 30 human enzymes belonging to 4 dis- 
parate central metabolic pathways (Table 3), 24 had 
candidate orthologs in C. elegans and 18 in Dro- 
sophila. In another search, using a set of 151 human 
leukocyte surface (CD) antigens, 78 shared similar- 
ity with Drosophila sequences and 89 with C. elegans 
sequences, although not unexpectedly most of 
these similarity matches were to portable modules 
(such as immunoglobulin-like or epidermal growth 
factor (EGF)-like domains) within a nonorthologous 
protein. Interestingly, there appear to be three times 
as many orthologs of human CD antigens in C. el- 
egans as in Drosophila (Table 3). Inspection of the 
similarixies showed that this difference in the num- 
ber of orthologs is explained by the almost total 
absence of certain classes of proteins related to CD 
antigens among the available Drosophila sequences, 
including large metalloproteases, the type II (4TM) 
transmembrane receptors, aminopeptidases. and 
apyrases. 

Concluding Remarks 

In this study we evaluated the nematode C. elegans 
and the fruit fly D. melanogaster as model systems 
for studying human proteins using protein se- 
quence comparison techniques. By applying strict 
and reproducible criteria for identifying ortholo- 
gous proteins, we could extract numerous protein- 
coding genes for phylogenetic analysis. Our simul- 
taneous analysis of multiple orthologous proteins 
shows that different proteins can generate different 
apparent phylogenetic tree topologies, strongly sug- 
gesting that historical phylogenies should not be 
inferred based on a single protein-coding gene. Un- 
equal evolutionary rates are an important factor in 
calculating phylogenetic trees, and indeed it ap- 
pears that the majority of C. elegans genes are evolv- 
ing more rapidly than their Drosophila counterparts. 
The approaches of ortholog extraction used in this 
work can be used to better define data sets for phy- 
logenetic analysis among a broader range of repre- 
sentative animal phyla. The available portion of the 
fly proteome appears to be comparatively enriched 
in conserved protein domains because of abundant 
representation of phenotypically defined genes, 
while missing numerous protein families. The or- 
tholog-to-paralog ratio with regard to human pro- 
teins is very similar in the two model animals, in- 
dicating that the domain architecture in fly and 



Table 3. Ortholog Conservation in 1 


Model 


Invertebrate Animals 








Orthologs 


Data sets (no. of 


C. 


D. 


human proteins) 


elegans melanogaster 


Positionally cloned genes 






mutated in specific 






human diseases (84) 


25 


13 


Biosynthetic enzymes^ 






(30) 


24 


18 


purine biosynthesis" 






(6) 


5 


4 


arginine and proline 






biosynthesis (8) 


6 


6 


sterol biosynthesis 






(11) 


8 


5 


folate biosynthesis/5 


5 


3 


Leukocyte surface 






antigens (151) 


15 


5 


=The list of proteins Is avails 


ble online at http://www. 


sequana.com/publicatlons/model_organisms. 




"Three human enzyme sequences in this category •: 


re unavail- 


able, so the yeast orthologs were used. 





596#GENOME RESEARCH 



Exhibit A: Page 7 of 9 



COMPARISON OF ORTHOLOGOUS PROTEINS IN MODEL ORGANISMS 



nematode proteins approximates that of their hu- 
man homologs to the same extent. 



METHODS 
Databases 

The NR database at the National Center for Biotechnology 
Information (Bethesda, MD) was used as the source of se- 
quences and for most of the database searches. Species- 
specific sets of proteins were extracted from NR using Nentrez 
networlt tools and the species names as queries, Unfinished 
genome sequences from the C. e/e^ans genome project (htip:// 
www.sanger.ac.ulc/ProJects/C_elegans) and database of C. el- 
egans ESTs (http://www.ddbj.nig.ac.jp/c-elegans/himl/ 
CE_INDEX.html) were used to verify the protein sequences 
predicted from genomic DNA. The database of human disease 
genes and their homologs is available at http://www.ncbi. 
nlm.nih.gov/XREFdb. and partial data on orthologs in the 
nematode and yeast are at http://www.ncbi.nlm.nih.gov/ 
Disease.Genes. A nonredundant list of human leukocyte sur- 
face (CD) antigens was constructed by modifying the list 
available at http://www.expasy.ch/cgi-bin/llsts7cdlist.txt. In- 
formation on biochemical pathways was obtained in part 
from http://www.genome.adJp/i(egg. 



Sequence Database Searching, Cross-Referencing, 
and Ortholog Identification 

Database searches were performed using the BLAST2 algo- 
rithm (Altschul and Gish 1996), with gap width 256 and no 
filtering. The BLASTP program was used to search protein da- 
tabases. The TBLASTN program was used to search the nucleo- 
tide sequence databases. In all searches, matches with simi- 
larity scores <75 were removed. This cutoff eliminates many 
spurious hits and virtually never eliminates orthologs for me- 
dium-sized proteins (A.R. Mushegian, unpubl.). To count the 
fraction of conserved and unique protein domains, a more 
restrictive similarity score cutoff, s > 90, was used. BLASTP 
results were automatically processed using the BLATAX pro- 
gram (Kodnin et al. 1996) to extract the best matches in the 
given species. 

Two measures were applied to distinguish candidate or- 
thologs from likely paralogs based on sequence similarity, the 
BLASTP similarity score and the percentage of amino acid 
identity in the aligned segments. Criteria used to define can- 
didate orthologs (Tatusov et al. 1996) were as follows. First, 
protein A in proteome a is a candidate ortholog of protein B in 
proteome b, if protein B is the best match when sequence A is 
searched against proteome b, and, conversely, protein A is the 
best match when sequence B is searched against proteome a. 
Second, A and B share similarity along their whole lengths. 
Third, no homolog in a taxonomic outgroup (5. cerevesiae in 
the present analysis) is closer to A than B, or closer to B than 
A. Sequences that belong to large, diverged protein families 
were not considered because of limitations in applying the 
classic definition of orthology in such cases (Gehring et al. 
1994; Tatusov et al. 1997). The following families were thus 
excluded from analysis: protein kinases, protein phosphata- 
ses, RAS-like GTPases and their regulators, chaperones of the 
HSP60, HSP70, and HSP90 families, and RNA-binding pro- 
teins containing RNA recognition motifs. 



Phyiogenetic Methods 

Amino acid sequence data sots for each of 42 protein-coding 
genes (sec Results and Discussion) included orthologs as de- 
fined above from S. ccrevislac, C. elcgans, D. mclanogastcr. and 
Homo sapiens. Orthologous protein quartets were aligned us- 
ing a star alignment procedure (Myers and Miller 1988), as 
implemented in Align Plus software version 3 (Scientific and 
Educational Software Co.) using the yeast sequence as a guide. 
This method was chosen because it does not invoke phyioge- 
netic assumptions to carry out the alignment. Each quartet 
alignment was adjusted interactively using the MACAW pro- 
gram (Schuler et al. 1991) to correct alignment errors, and 
regions where amino acid similarity were too low to be certain 
of the alignment were deleted. The alignments are available at 
http://chuma.cas. usf,edu/~garey/alignments/alignment. 
html. Phyiogenetic analysis was carried out only with quartet 
alignments where amino acid identity was >35% among all 
members. Sequence sites in an alignment with gaps in any 
single taxon sequence were excluded from phyiogenetic 
analysis. Maximum parsimony trees were produced with the 
PHYLIP package (Felsenstein 1993). Evolutionary distances 
and neighbor-joining trees were calculated using both a Pois- 
son distribution of amino acid substitutions and a -y correc- 
tion (shape parameter = 2) using the MEGA program (Kumar 
et al. 1994). All trees were tested by the analysis of 100 boot- 
strap replicates. 
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